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(57) Abstract 

A method for recognizing a pattern that comprises a set of physical stimuli, said method comprising the steps of: providing a set 
of training observations and through applying a plurality of association models ascertaining various measuring values pj(k Y x), M, 
that each pertain to assigning a particular training observation to one or more associated pattern classes; setting up a log/linear association 
distribution by combining all association models of the plurality according to respective weight factors, and joining thereto a normalization 
quantity to produce a compound association distribution; optimizing said weight factors for thereby minimizing a detected error rate of the 
actual assigning to said compound distribution; recognizing target observations representing a target pattern with the help of said compound 
distribution. 
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1 

Method of determining model-specific factors for pattern recognition, in particular for speech 
patterns. 



BACKGROUND OF THE INVENTION 

The invention relates to a method for recognizing a pattern that comprises 
a set of physical stimuli, said method comprising the steps of: 

- providing a set of training observations and through applying a plurality of association 
5 models ascertaining various measuring values pj(k | x), j = l...M, that each pertain to 

assigning a particular training observation to one or more associated pattern classes; 

- setting up a log/linear association distribution by combining all association models of the 
plurality according to respective weight factors, and joining thereto a normalization quantity 
to produce a compound association distribution. 

10 The invention has been conceived for speech recognition, but is likewise 

applicable to other recognition processes, such as for speech understanding, speech 
translation, as well as for recognizing handwriting, faces, scene recognition, and other 
environments. The association models may be probability models that give probability 
distributions for assigning patterns to classes. Other models can be based on fuzzy logic, or 

15 similarity measures, such as distances measured between target and class. Known technology 
has used different such models in a combined recognition attack, but the influences lent to 
the various cooperating models were determined in a haphazard manner. This meant that 
only few and/or only elementary models were feasible. 

The present inventor has recognized that the unification of Maximum- 

20 Entropy and Discriminative Training principles would in case of combination of more than 
one model in principle be able to attain superior results as compared with earlier heuristic 
methods. Also, a straightforward data processing procedure should provide a cheap and fast 
road to those results. 

In consequence, amongst other things, it is an object of the invention to 

25 evaluate a log-linear combination of various 'sub 'models pj(k | X) whilst executing parameter 
evaluation through discriminative training. Now, according to one of its aspects, the 
invention attains the object by recognizing a pattern that comprises a set of physical stimuli, 
said method comprising the steps of: 

- providing a set of training observations and through applying a plurality of association 
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models ascertaining various measuring values pj(k | x), j = l...M, that each pertain to 
assigning a particular training observation to one or more associated pattern classes; 

- setting up a log/linear association distribution by combining all association models of the 
plurality according to respective weight factors, and joining thereto a normalization quantity 

5 to produce a compound association distribution; 

- optimizing said weight factors for thereby minimizing a detected error rate of the actual 
assigning to said compound distribution; 

- recognizing target observations representing a target pattern with the help of said compound 
distribution. Inter alia, such procedure allows to combine any number of models into a single 

10 maximum-entropy distribution. Furthermore, it allows an optimized interaction of models 

that may vary widely in character and representation. 

The invention also relates to a method for modelling an association 

distribution according to the invention. This provides an excellent tool for subsequent users 

of the compound distribution for recognizing appropriate patterns. 
15 The invention also relates to a method for recognizing patterns using a 

compound distribution produced by the invention. This method has users benefitting to a 

great deal by applying the tool realized by the invention. 

The invention relates to a system that is arranged for practising a method 

according to the invention. Further aspects are recited in dependent Claims. 

20 

BRIEF DESCRIPTION OF THE DRAWING 

These and other aspects and advantages of the invention will be discussed 
more in detail with reference to the detailed disclosure of preferred embodiments hereinafter, 
and in particular with reference to the appended Figures that show: 
25 Fig. 1, an overall flow chart of the method; 

Fig. 2, a comprehensive system for practising the invention; 

Figures 3-21 give various equations B1-B20 used with the automatic 
method according to the invention. 

30 DETAILED DISCLOSURE OF PREFERRED EMBODIMENTS 

The invention being based on a balanced application of mathematics on 
the handling and accommodating of physical quantities that may be of very diverse character, 
much of the disclosure is based on advanced mathematics. However, both the starting point 
and the eventual outcome have permanently physical aspects and relevance. The speech 
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recognition may be used to control various types of machinery. Scene analysis may guide 
unmanned vehicles. Picture recognition may be used for gate control. Various other 
applications are evident per se. The expressions hereinafter are numbered in sequence, and 
will be referred to in the text by these numbers. 
5 The invention determines model-specific factors in order to combine and 

optimize several different models into a single pattern recognition process, notably for speech 
recognition. 

The statistical speech recognition method utilizes Bayes' decision theory in 
order to form an identification mechanism with a minimum error rate. In conformity with 
10 this theory, the decision rule is such, that an observation x must be assigned to the class k (x 
€ k for brevity), when for a given a posteriori or "real" probability distribution x(kjx) it 
holds that: 



15 



\fk f = 1,...,#; k ; *k : log T(k j x) >0=>^t (D 

it(k \x) 



In literature, the term log(7r(kjx)/7r(k' jx)) is called the discriminant 
function. Hereinafter, this term will be noted g(x,k,k') will be used for brevity. When the 

T 

decision rule (1) is used for recognizing complete sentences, observed expressions x x = 
(x 1 ,...,x T ) f that have a temporal length T, will be classified as spoken word sequences 
20 w x * (w\...,w s ) of length S. The a posteriori distribution v(wf\x[) is however unknown 
since it describes the complicated natural speech communication process of humans. 

S T 

Consequently, it must be approximated by a distribution p(w x \x x ). Thus far, the acoustic- 
phonetic and grammatical modelling of speech in the form of parametric probability 

S T 

distributions have attained the best results. The form of the distribution p(w x \x x ) is then 
25 predetermined; the unknown parameters of the distribution are estimated on the basis of 
training data. The distribution p(w x \xf) so acquired is subsequently inserted into Bayes* 

decision rule. The expression x x is then assigned to the word sequence w x for which: 
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log 



p( w l 1*1 ) 

p(w'\ \x[) 



> 0 



(2) 



Conversion of the discriminant function 



5 g(s?,w?,w'\) 



log 



p(wf)p(x{\ wf) 



(3) 



allows to separate the grammatical model p(w*) from the acoustic-phonetic model p(x x \w x ) 
in a natural way. The grammatical model p(w*) then describes the probability of occurrence 
of the word sequence w( per se, and the acoustic-phonetic model p(xf\wf) evaluates the 
10 probability of occurrence of the acoustic signal xf during the uttering of the word sequence 

wf. Both models can then be estimated separately, so that an optimum use can be made of 

the relatively limited amount of training data. The decision rule (3) could be less than 
optimum due to a deviation of the distribution p from the unknown distribution ?r, even 
though the estimation of the distribution p was optimum. This fact motivates the use of so- 
15 called discriminative methods. Discriminative methods optimize the distribution p directly in 
respect of the error rate of the decision rule as measured empirically on training data. The 
simplest example of such discriminative optimization is the use of the so-called language 
model factor X. The equation (3) is then modified as follows: 



20 g(x[,w^w'\ ) = log 



(4) 



p(w>\ ) X p(x{\w>\) 



WO 99/31654 PCT/IB98/01990 

5 

Experiments show that the error rate incurred by the decision rule (4) decreases when 
choosing X > 1. The reason for this deviation from theory, wherein X = 1, lies in the 

incomplete or incorrect modelling of the probability of the compound event (w lf x x ) . The 

latter is inevitable, since the knowledge of the process generating the event (w l ,x l ) is 
5 incomplete. 

Many acoustic-phonetic and grammatical language models have been 
analyzed thus far. The object of these analyses was to find the "best" model for the relevant 
recognition task out of the set of known or given models. All models determined in this 
manner are however imperfect representations of the real probability distribution, so that 

10 when these models are used for pattern recognition, such as speech recognition, incorrect 
recognitions occur as incorrect assignment to classes. 

It is an object of the invention to provide a modelling, notably for speech, 
which approximates the real probability distribution more closely and which nevertheless can 
be carried out while applying only little processing effort, and in particular to allow easy 

15 integration of a higher number of known or given models into a single classifier mechanism. 



SUMMARY OF THE INVENTION 

The novel aspect of the approach is that it does not attempt to integrate 
known speech properties into a single acoustic-phonetic distribution model and into a single 
20 grammatical distribution model which would involve complex and difficult training. The 

various acoustic-phonetic and grammatical properties are now modeled separately and trained 

in the form of various distributions Pj{w?\xf), j-l...M, followed by integration into a 

compound distribution 



M 



C<A). n P/w/I^V' 



25 pf x) = (Wj'lx, ) 



(5) 



= exp 



M T 
logC(A) + 52 Xlogp/wflxj ) 
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The effect of the model pj on the distribution p* A] is determined by the associated coefficient 

The factor C(A) ensures that the normalization condition for probabilities 
is satisfied. The free coefficients A « (X lt ...,X M ) ,r are adjusted so that the error rate of the 
5 resultant discriminant function 



is as low as possible. There are various possibilities for implementing of this basic idea, 
10 several of which will be described in detail hereinafter. 

First of all, various terms used herein will be defined. Each word 

sequence w/ forms a class k; the sequence length S may vary from one class to another. A 

speech utterance x[ is considered as an observation x; its length T may then differ from one 
observation to another. 

15 Training data is denoted by the references (x n , k), with n = 1,...,N; k = 

0,...,K. Herein N is the number of acoustic training observations x n , and k^ is the correct 
class associated with the observation x n . Further, k s* k n are the various incorrect rival 

classes that compete with respect to k n . 

The classification of the observation x into the class k in conformity with 
20 Bayes' decision rule (1) will be considered. The observation x is an acoustic realization of 
the class k. In the case of speech recognition, each class k symbolizes a sequence of words. 
However, the method can be applied more generally. 

Because the class 1^ produced by the training observation x n is known, an 

ideal empirical distribution t(k\x) can be constructed on the basis of the training data (x n , 
25 k); n«l... N; k=0...K. This distribution should be such that the decision rule derived 
therefrom has a minimum error rate when applied to the training data. In the case of 
classification of complete word sequences k, a classification error through selecting an 
erroneous word sequence k k^ may lead to several word errors. The number of word 
errors between the incorrect class k and the correct class k n is called the Levenshtein distance 
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E^ka) . The decision rule formed from ECk,!^) has a minimum word error rate when a 
monotony property is satisfied. 

The ideal empirical distribution ft is a function of the empirical error 
value Efoko) which is given only for the training data, but is not defined with respect to 
5 unknown test data, because the correct class assignment is not given in that case. Therefore 
on the basis of this distribution there is sought a distribution 



p*{A}(k\x) « 



expj 


WW) 




£f.. exp| 


[23-1 ty°g/»/* 


>)} 



(7) 



10 which is defined over arbitrary, independent test data and has an as low as possible empirical 
error rate on the training data. If the M predetermined distribution models 
p 1 (kjx),...,p M (k|x), are defined on arbitrary test data, the foregoing also holds for the 

distribution p* A} (k\x). When the freely selectable coefficients A = (XI,..., X M ) tt are 
determined in such a manner that p* A }(k\x) on the training data has a minimum error rate, 

15 and if the training data is representative, p* A }(k\x) should yield an optimum decision rule 

also on independent test data. 

The GPD method as well as the least mean square method optimize a 
criterion which approximates the mean error rate of the classifier. In comparison with the 
GPD method, the least mean square method offers the advantage that it yields a closed 
20 solution for the optimum coefficient A. 

The least mean square method will first be considered. Because the 
discriminant function (1) determines the quality of the classifier, the coefficients A should 
minimize the root mean square deviation B14 of the discriminant functions of the 

distributions p( A} (k\x) from the empirical error rate E(k,k n ). The summing over r then 
25 includes all rival classes in the criterion. Minimizing D (A) leads to a closed solution for the 
optimum coefficient vector A - Q' l P (9), further detailed by B15 and B16. 
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8 



Herein, Q is the autocorrelation matrix of the discriminant functions of the predetermined 
distribution models. The vector P expresses the relationship between the discriminant 
functions of the predetermined models and the discriminant function of the distribution t . 



The word error rate E(k, of the hypotheses k is thus linearly taken up in the coefficients 
X 1? ...,X M * Conversely, the discrimination capacity of the distribution model p t is linearly 
included in the coefficients X ls ...,X M for determining the coefficients directly via the 

Pt<k\x n ) 



discriminant function log 



Pi( k n\ x r) 



10 Alternatively, these coefficients can be determined by using the GPD 

method. With this method, the smoothed empirical error rate E(A): 



N 



N n = l 



(12) 



15 l(x n ,k n0 ,A) 





r 




i K 

T E ex P < 


1 + A 



~7?log 



P{A}( k n\ x n) 
P{A}(k\ x n) 





1 




B 






* 






* 



-1 



(13) 



20 



can be directly minimized for the training data. The left hand expression is then a smoothed 
measure for the error classification risk of the observation x n . The values A>0, B>0, t?>0 
determine the type of smoothing of the error classification risk and should be suitably 
predetermined. When E(X) is minimized in respect of the coefficient X of the log linear 
combination, the following iteration equation with the step width M is obtained for the 
coefficients Xj, wherein j = 1,...,M 



25 



xf 



= 1 (11), and furthermore according to B13 and B 14, and 



(Xj ,...,X^) \ j — l t ...,Af 



WO 99/3 1 654 PCT/IB98/0 1 990 



It is to be noted that the coefficient vector A is included in the criterion E(A) by way of the 
discriminant function 



, P{A] (*ti I x r) 
log-JLi 

P{A} ( k \ x n$ 



(12) 



If E(A) decreases, the discriminant function (12) should increase on average because of (9) 
and (10). This results in a further improved decision rule, see (1). 

In the above, the aim has been to integrate all available knowledge 
sources into a single pattern recognition system. Two principles are united. The first is the 

10 maximum-entropy principle. This works by introducing as few assumptions as possible, so 
that uncertainty is maximized. Thus, exponential distributions must be used. In this manner 
the structure of the sources combination is defined. The second principle is discriminative 
training, to determine the weighting factors assigned to the various knowledge sources, and 
the associated models. Through optimizing the parameters, the errors are minimized. For 

15 speech, models may be semantic, syntactic, acoustic, and others. 

The approach is the log-linear combining of various submodels and the 
estimating of parameters through discriminative training. In this manner, the adding of a 
submodel may improve the recognition score. If not, the model in question may be 
discarded. A submodel can however never decrease the recognition accuracy. In this manner, 

20 all available submodels may be combined to yield optimum results. Another application of 
the invention is to adapt an existing model combination to a new recognition environment. 

The theoretical approach of the procedure includes various aspects: 

- parabolic smoothing of the empirical error rate 

- simplifying the theory of "minimum error rate training" 

25 - providing a closed form solution that needs no iteration sequence. 

The invention furthermore provides extra facilities: 

- estimating an optimum language model factor 

- applying a log-linear Hidden Markov Model 

- closed form equations for optimum model combination 

30 - closed form equations for discriminative training of class-specific probability distributions. 

Now for the classification task specified in (1), the true or posterior 
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distribution x(k \ x) is unknown but approximated by a model distribution (p(k | x). The two 
distribution differ, because of incorrect modelling assumptions and because of insufficient 
data. An example is the language model factor X used in equation Bl. 

The formal definition combines various submodels pj(k | x), j=l...M into 
5 a log-linear posterior distribution p {A j(k | x) = exp {..} as given in (5). Next to the log- 
linear combination of the various submodes, the term log C(A) allows normalization to attain 
a formal probability distribution. The resulting discriminant function is 



| 0g *W* |x) . y X, log J^j (B2) 
/>{A}<*'I*> j J />/*'l*> 



10 



The error rate is minimized and A optimized. Optimizing on the sentence level is as follows: 

• Class k: word sequence 

• Observation x: spoken utterance (e.g. sentence) 

• N training samples x n , giving the correct sentence 
15 * For each sample x n 

- kjj." correct class as spoken 

- k =£ kjj.- rival classes, which may be all possible sentences, or for example, a 
reasonable subset thereof. 

• Similarity of classes: E(k n ,k) 

20 - E: suitable function of Levenshthein-Distance, or a similarly suitable measure that is 

monotonous. 

• Number of words in wordsequence k n : L n . 

Now, equation B3 gives an objective function, the empirical error rate. 
Herein, the left hand side of the equation introduces the most probable class that bases on the 
25 number of erroneous deviations between classes k and k^. 

The parameters A may be estimated by: 

• a minimum error rate training through Generalized Probabilistic Descent, which yields an 
iterative solution. 

• a modification thereof combines with parabolic smoothing, which yields a closed form 
30 solution. 

• a third method bases on least squares, which again yields a closed form solution. 

For the GPD method, the smoothed empirical error rate minimizing is 
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based on expression B4. The smoothed misclassification risk is given by equation B5, and 
the average rivalry by equation B6. 

The smoothed empirical error ate is minimized through B7. Herein* I is a 
loss function thai for straightforward calculations must be differentiatable. Rivalry is given 
5 by equation B8, wherein E indicates the number of errors. Average rivalry is given through 
the summing in equation B9. A smoothed misclassification risk is expressed by equation BIO 
that behaves like a sigmoid function. For ^—-00,1 becomes zero, for R^ + oo, the limiting 
value is 1=1. Herein A, B are scaling constants greater than zero. Differentiating to A yields 
expression Bll, wherein the vector A® is given by expression B12 and the eventual outcome 
10 by expression B13. 

The invention also provides a closed form solution for finding the 
discriminative model combination DMC. The solution is to minimize the distance between on 
the one hand the discriminant function and on the other hand the ideal discriminant function 
ECk^k) in a least squares method. The basic expression is given by equation B14. Herein, 

15 A=Q~ ! P, wherein Q is a matrix with elements Qy given by equation B15. Furthermore, P is 
a vector with elements Pj given by equation B16. Now, the empirical error rate has been 
given earlier in equation B3. For calculatory reasons this is approximated by a smoothed 
empiric^ error rate as expressed by equation B17. Herein, an indication is given on the 
V- ; •'. i^Jai^fe^fc of errors between k and k^ through using a sigmoid function S or a similarly useful 

20 function. A useful form is S(x) = {(x+B)/(A+B)} 2 , wherein -B<x<A and -B<0<A. for 

V^fl^l^gher values of x, S=l, and for lower values S-0. This parabola has proved to be useful. 
Various other second degree curves have been found useful. The relevant rivals must now lie 
in the central and parabolically curved interval of S. Now, finally, a normalization constrain!; 
is added for A according to expression B18. 

25 : - second criterion is solved according to a matrix equation (a, 

X tr ) tr «Q" 1 P% wherein an additional row and column have been supplemented to matrix Q* 
for normalization reasons, according to Q* ||~0; Q'o^l, Q' ij0 *=l/2(A+B) z . The general 
element of correlation matrix Q* has been given in equation B19. Note that the closed 
solution is rendered possible through the smoothed step function s. Furthermore, vector P* 

30 likewise gets a normalizing element P 0 = l, whereas its general element is given by 

expression B20. 

Experiments have been done with various M-gram language models, such 
as bigram, trigram, fourgram or tetragram models, various acoustic models, such as word- 
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internal-triphone, cross- word-trigram and pentaphone models. Generally, the automatic DMC 
procedure performs equally well as the results produced by non-automatic fine tuning using 
the same set of submodels. However, the addition of extra submodels according to the 
automatic procedure of the invention allowed to decrease the number of errors by about 8 % . 
5 This is considered a significant step forward in the refined art of speech recognition. It is 
expected that the invention could provide similarly excellent results for recognizing other 
types of patterns, such as signatures, handwriting scene analysis, and the like, given the 
availability of appropriate sub-models. Other submodels used for broadcast recognition 
included mllr adaptation, unigram, distance- 1 bigram, wherein an intermediate element is 

10 considered as don't care, pentaphones and wsj-models. In this environment, raising the 

number of submodels in the automatic procedure of the invention also lowered the numbers 
of errors by a significant amount of 8-13%. 

Figure 1 shows an overall flow chart of a method according to the 
invention. In block 20 the training is started on a set of training data or patterns that is 

15 provided in block 22. The start as far as necessary claims required software and hardware 
facilities; in particular, the various submodels and the identity of the various patterns is also 
provided. For simplicity, the number of submodels has been limited to 2, but the number 
may be higher. In parallel blocks 24 and 26, the scores are determined for the individual 
submodels. In block 28 the log-Iin combination of the various submodels is executed and 

20 normalized. In block 30 the machine optimizing of vector A in view of the lowest attainable 
error rate is executed. Note that vector A may have one or more zero-valued components to 
signal that the associated submodel or -models would bring about no improvement at all. 

Next, the vector A and the various applicable submodels will be used for 
recognizing target data, as shown in the right half of the Figure. The training at left, and the 

25 usage at right may be executed remote from each other both in time and in space; for 
example a person could have a machine trained to that person's voice at a provider's 
premises. This might require extra data processing facilities. Later, the machine so trained 
may be used in a household or automobile environment, or other, Thus, blocks 40-46 have 
corresponding blocks at left. 

30 In block 48 the scorings from the various submodels are log-lin combined, 

using the various components of vector A that had been found in the training. Finally, in 
block 50 the target data are classified using the results from block 50. In block 52, the 
procedure is stopped when ready. 

Figure 2 shows a comprehensive system for practising the invention. The 



WO 99/31654 PCT/IB98/01990 

13 

necessary facilities may be mapped on standard hardware, or on a special purpose machine. 
Item 60 is an appropriate pickup, such as a voice recorder, a two-dimensional optical 
scanner, together with A/D facilities and quality enhancing preprocessing if necessary. Block 
64 represents the processing that applies programs from program memory 66 on data that 
5 may arrive from pickup 60, or from data storage 62, where they may have been stored 
permanently or transiently, after forwarding from pickup 60. Line 70 may receive user 
control signals, such as start/stop, and possibly training-supportive signals, such as for 
definitively disabling a non-contributory submodel. 

Block 68 renders the recognition result usable, such as by tabulating, 
10 printing, addressing a dialog structure for retrieving a suitable speech answer, or selecting a 
suitable output control signal. Block 72 symbolizes the use of the recognized speech, such as 
outputting a speech riposte, opening a gate for a recognized person, selecting a path in a 
sorting machine, and the like. 
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CLAIMS: 



1 . A method for recognizing a pattern that comprises a set of physical 

stimuli, said method comprising the steps of: 

- providing a set of training observations and through applying a plurality of association 
models ascertaining various measuring values pj(k | x), j = l...M, that each pertain to 

5 assigning a particular training observation to one or more associated pattern classes; 

- setting up a log/linear association distribution by combining all association models of the 
plurality according to respective weight factors, and joining thereto a normalization quantity 
to produce a compound association distribution; 

- optimizing said weight factors for thereby minimizing a detected error rate of the actual 
10 assigning to said compound distribution; 

- recognizing target observations representing a target pattern with the help of said compound 
distribution. 



2. A method for modelling an association distribution for patterns that 

15 comprise a plurality of physical stimuli; said method comprising the steps of: 

- providing a set of training observations and through applying a plurality of association 
models ascertaining various measuring values pj(k | x), j = L..M, that each pertain to 
assigning a particular training observation to one or more associated pattern classes; 

- setting up a log/linear association distribution by combining all association models of the 
20 plurality according to respective weight factors, and joining thereto a normalization quantity 

to produce a compound association distribution; 

- optimizing said weight factors for thereby minimizing a detected error rate of the actual 
assigning to said compound distribution. 



25 3 . A method for recognizing a pattern that comprises a set of physical 

stimuli, said method comprising the steps of: 

- receiving a plurality of association models indicating various measuring values pj(k | x), 
j = l...M, that each pertain to assigning an observation to one or more associated pattern 
classes, as being combined in a log/linear association distribution according to respective 
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weight factors, and joined thereto a normalization quantity to produce a compound 
association distribution; 

- optimizing said weight factors for thereby minimizing a detected error rate of the actual 
assigning to said compound distribution; 
5 - recognizing target observations representing a target pattern with the help of said compound 
distribution. 



4. A method as claimed in Claim 1, wherein said association model is a 
probability model, and said association distribution is a probability model for associating. 

10 

5. A method as claimed in Claim 1, wherein said optimizing is effected 
through minimizing a training error rate in an iterative manner, wherein said error rate is 
expressed in a continuous and differentiable manner as a function of rivalry values of non- 
optimum assigning. 

15 

6. A method as claimed in Claim 1, wherein said optimizing is effected in a 
least squares method between an actual discriminant function as resulting from said 
compound distribution and an ideal discriminant function, as expressed on the basis of an 

error rate, whilst expressing the weight vector A in a closed expression as A =Q _1 P, 

20 wherein: 

Q: autocorrelation matrix of the discriminant functions of the various models 

P: correlation vector between the error rate and the discriminant functions. 

7. A method as claimed in Claim 6, wherein the empirical error rate is 
25 smoothed through representing it as a second degree curve in an interval (~B,A), whilst 

normalizing A through a constraining J2 > whilst furthermore expressing the weight 

vector A in a closed expression according to A =Q" 1 P\ wherein Q* is an extended 
autocorrelation matrix through a normalization item added, and P an extended correlation 
vector through a further normalization item added. 

30 

8. A method as claimed in Claim 4, and applied to speech recognition, 
wherein said probability models comprise one or more of the set of: 

as language models: bigram, trigram, fourgram, 
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as acoustic models: word-internal triphones, cross-word triphones, maximum likelihood 
linear regression adaptation models; 

as additional models: unigram, distance- 1-bigram (the middle element being assumed don't 
care), pentaphones, 

5 

9. A system being arranged for executing a method as claimed in Claim 1 

for recognizing a pattern that comprises a set of physical stimuli, said system comprising: 

- pickup means for receiving a body of training observations and being interconnected to first 
processing means for through a plurality of stored association models ascertaining various 

10 measuring values pj(k j x), j = 1...M, that each pertain to assigning a particular observation 
to one or more classes of patterns; 

- second processing means fed by said first processing means and being arranged for setting 
up a log/linear association distribution by combining all association models of the plurality 
according to respective weight factors, and for joining thereto a normalization quantity to 

15 produce a compound association distribution; 

- third processing means fed by said second processing means for optimizing said weight 
factors for thereby minimizing a detected error rate of the actual assigning to said compound 
distribution; 

- recognizing means fed by said third processing means for recognizing target observations 
20 representing a target pattern with the help of said compound distribution. 
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